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Abstract — This paper provides a theoretical explanation on 
tlie clustering aspect of nonnegative matrix factorization (NMF). 
We prove that even without imposing orthogonality nor sparsity 
constraint on the basis and/or coefficient matrix, NMF still can 
give clustering results, thus providing a theoretical support for 
many works, e.g., Xu et al. 1 1| and Kim et al. |2|, that show the 
superiority of the standard NMF as a clustering method. 

Keywords — bound-constrained optimization, clustering method, 
non-convex optimization, nonnegative matrix factorization 

I. Introduction 

NMF is a matrix approximation technique that factorizes a 
nonnegative matrix into a pair of other nonnegative matrices 
of much lower rank: 

A«BC, (1) 

where A e K^^^^ = [ai,...,a7v] denotes the feature- 
by-item data matrix, B e M*^^^ = [bi,...,bK] denotes 
the basis matrix, C G ]R._|_ ^ = [ci,...,CAr] denotes the 
coefficient matrix, and K denotes the number of factors which 
usually chosen so that K <C iniii(M, N). There are also other 
variants of NMF like semi-NMF, convex NMF, and symmetric 
NMF. Detailed discussions can be found in, e.g., [U and ||4]. 

The nonnegativity constraints and the reduced dimensional- 
ity define the uniqueness and power of NMF. The nonnegativ- 
ity constraints allow only nonsubstractive linear combinations 
of the basis vectors b^ to construct the data vectors a„, thus 
providing the parts-based interpretations as shown in [5l, fU, 
lUll. And the reduced dimensionality provides NMF with the 
clustering aspect and data compression capabilities. 

The most important NMF's application is in the data clus- 
tering, as some works have shown that it is a superior method 
compared to the standard clustering methods like spectral 
methods and i^-means algorithm. In particular, Xu et al. 11] 
showed that NMF outperforms standard spectral methods in 
finding the document clustering in two text corpora, TDT2 and 
Reuters. And Kim et al. f2] showed that NMF and sparse NMF 
are much more superior methods compared to the _ft'-means 
algorithm in both a synthetic dataset (which is well separated) 
and a real dataset (TDT2). 

If sparsity constraints are imposed to columns of C, the 
clustering aspect of NMF is intuitive since in the extreme case 
where there is only one nonzero entry per column, NMF will 
be equivalent to the iiT-means algorithm employed to the data 
vectors a„ [S], and the sparsity constraints can be thought as 
the relaxation to the strict orthogonality constraints on rows of 



C (an equivalent explanation can also be stated for imposing 
sparsity on rows of B). 

However, as reported by Xu et al. fTl and Kim et al. fT\, 
even without imposing sparsity constraints, NMF still can give 
very promising clustering results. But the authors didn't give 
any theoretical analysis on why the standard NMF — NMF 
without sparsity nor orthogonality constraint — can give such 
good results. So far the best explanation for this remarkable 
fact is only qualitative: the standard NMF produces non- 
orthogonal latent semantic directions (the basis vectors) that 
are more likely to correspond to each of the clusters than those 
produced by the spectral methods, thus the clustering induced 
from the latent semantic directions of the standard NMF are 
better than clustering by the spectral methods |[T]. Therefore, 
this work attempts to provide a theoretical support for the 
clustering aspect of the standard NMF. 

II. Clustering aspect of NMF 

To compute B and C, usually eq. [T] is rewritten into a 
minimization problem in the Frobenius norm criterion. 

1, 



minJ(B,C) 

B.C 



|A-BC|||, s.t. B > 0,C > 0. (2) 



In addition to the usual Frobenius norm criterion, the family of 
Bregman divergences — which Frobenius norm and KuUback- 
Leibler divergence are part of it — can also be used as the 
affinity measures. Detailed discussion on the Bregman diver- 
gences for NMF can be found in ||9l. 

Sometimes it is more practical and intuitive to decompose 
J (B, C) into a series of smaller objectives. 



mm 

B.C 



in J(B, C) = ( min Ji(B,Ci), . . . , min J7v(B,CAr) 

'^ yB,ci B,C7v 



(3) 



where 



B.C 



minJ„(B,c„) = -||a„-Bc„||2, ne[l,7V]. (4) 



Minimizing J„ is known to be the nonnegative least square 
(NLS) problem, and some fast NMF algorithms are developed 
based on solving the NLS subproblems, e.g., alternating NLS 
with block principal pivoting algorithm ifTOl . active set method 
[illil . and projected quasi-Newton algorithm fT2l. Decompos- 
ing NMF problem into NLS subproblems also transforms the 
non-convex optimization in eq. |3] to the convex optimization 
subproblems in eq.|4] Even though eq.|4]is not strictly convex. 



for two-block case, any limit point of the sequence {B*,C*}, 
where t is the updating step, is a stationary point ifTSll . 

The objective in eq. |4] aims to simultaneously find the 
suitable basis vectors such that the latent factors are revealed, 
and the coefficient vector c„ such that a linear combination of 
the basis vectors (Bc„) is close to a„. In clustering term this 
can be rephrased as: to simultaneously find the cluster centers 
and the cluster assignments. 

To investigate the clustering aspect of NMF, four possibili- 
ties of NMF settings are discussed: (1) imposing orthogonality 
constraints on both rows of C and columns of B, (2) imposing 
orthogonality constraints on rows of C, (3) imposing orthog- 
onaUty constraints on columns of B, and (4) no orthogonality 
constraint is imposed. The last case is the standard NMF which 
its clustering aspect is the focus of this paper as many works 
reported that it is a very effective clustering method. 

A. Orthogonality constraints on both B and C 

The following theorems proves that imposing column- 
orthogonality constraints on B and row-orthogonality con- 
straints on C lead to the simultaneous clustering of similar 
items and related features. 

Theorem 1. Minimizing the following objective 



minJ,(B,C) = i|!A-BC||^ 

s.t. B > 0, C> 0, B^B = I, CC^ = I 



(5) 



is equivalent to applying ratio association to Q{A^ A) and 
Q{AA^), where A^ A and AA^ are the item affinity matrix 
and the feature affinity matrix respectively, thus leads to 
simultaneous clustering of similar items and related features. 

Proof: 

\\A - BC||| = tr ({A - BCf (A - BC)^ 

= tr (A^A - 2C^B^A + I) . (6) 

The Lagrangian function: 

La (B, C) = Ja (B, C) - tr (reB^) - tr (TcC) + 

tr (Ab (B^B - I)) + tr (Ac (CC^ - I)) , 

(7) 



where Tb G Mf ■^, Tc £ 



tKxK 



.+ ,.c^-+ % Ab GM^^",and Ac e 
ji^^ are the Lagrange multipliers. By the Karush-Kuhn- 
Tucker (KKT) optimality conditions we get: 



Vb^, = B - AC^ - Tb + 2BAb = 0, (8) 

VcLa = C - B^A - rg + 2AcC - 0, (9) 

with complementary slackness: 

rB®B = o, rg»c = o, (10) 



where ® denotes component-wise multiplications. Assume 
Tb == 0, Ab = 0, Tc = 0, and Ab = (at the 
stationary point these assumptions are reasonable since the 



complementary slackness conditions hold and the Lagrange 
multipliers can be assigned to zeros), we get: 



B = AC^ and 
C = B^A. 



(11) 
(12) 



Substituting eq. [TT] into eq. |6] we get: 

min Ja (C) = max tr (CA'^AC^) . (13) 

Similarly, substituting eq. [12] into eq. |6] we get: 

min Ja (B) = max tr (B^AA^B) . (14) 

IB 13 

Therefore, minimizing Ja is equivalent to simultaneously 
optimizing: 

max tr (CA^AC^) s.t. CC^ = I, and (15) 

(16) 



max tr (B'^AA'^B) s.t. B'^B = I. 

B ^ ' 



Eq. [15] and eq[T6] are the ratio association objectives (see 
B14J for details on various graph cuts objectives) applied 
to Q{A^ A) and Q{AA)^ respectively. Thus minimizing Ja 
leads to the simultaneous clustering of similar items and 
related features. ■ 

B. Orthogonality constraints on C 

When the orthogonality constraints are imposed only on 
rows of C, it is no longer clear whether columns of B will 
lead to the feature clustering. The following theorem shows 
that without imposing the orthogonality constraints on bj., the 
resulting B can still lead to the feature clustering. 

Theorem 2. Minimizing the following objective 



minJ6(B,C) 



1, 



IIA-BCI 



B,c - - ■ ' 2 

s.t. B >0,C> 0,CC^ 



(17) 



is equivalent to applying ratio association to Q{A^ A), and 
also leads to the feature clustering indicator matrix B which 
is approximately column-orthogonal. 

Proof: 



|A-BC||| 



= tr((A-BC)' (A-BC) 

= tr (A^A - 2B^AC^ + C'^B^BC) 



(18) 
The Lagrangian function: 

Lb (B, C) = Jfc (B, C) - tr (TbB^) - tr (FcC) + 

tr(Ac(CC^-I)). (19) 



By applying the KKT conditions, we get: 

B = AC^ and 
C = B^A. 



(20) 
(21) 



By substituting eq. |20] and eq. |2T| into eq. [18] minimizing J^ 
is equivalent to simultaneously optimizing: 

max tr (CA^AC^) s.t. CC'^ = I, (22) 

max tr (B^AA'^B) , and (23) 

B ^ ' 

min tr (A^BB'^BB'^A) = min tr (B^BB^B) . (24) 

IB IB 

Note that the step in eq. |24]is justifiable since A is a constant 
matrix. By using the fact tr(X^X) = ||X|||,, eq. |24]can be 
rewritten as: 



min B^B 

B " 



|2 
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(E(bfbO^ 



E(brb, 



(25) 

The objective in eq.|22]is equivalent to eq.[T5]and eventually 
leads to the clustering of similar items. So the remaining 
problem is how to prove that optimizing eq. [23] and [25] 
simultaneously will lead to the feature clustering indicator 
matrix B which is approximately column-orthogonal. 

Eq. |23] resembles eq. [141 but without orthogonality nor 
upper bound constraint, so one can easily optimizing eq. |23]bv 
setting B to an infinity matrix. However, this violates eq. [25] 
which favors small B. Conversely, one can optimizing eq. [25] 
by setting B to a null matrix, but again this violates eq. [23] 
Therefore, these two objectives create implicit lower and upper 
bound constraints on B, and eq.|23]and eq.|25]can be rewritten 
into: 



max tr 

B 



and 



(b^ab) , 



mm 

B 



S.t. < B < Tb, 



i^i 



(26) 
(27) 



3b1 



where A denotes the feature affinity matrix and Tb denotes 
the upperbound constraints on B. Now we have box-constraint 
objectives which are known to behave well and are guaranteed 
to converge to the stationary point lITSl . 

Even though the objectives are now transformed into box- 
constraint optimization problems, since there is no column- 
orthogonality constraint, maximizing eq. |26]can be easily done 
by setting each entry of B to the corresponding largest possible 
value (in graph term this means to only create one partition 
on ^(A)). But this scenario results in the maximum value of 
eq. [27] which violates the objective. Conversely, minimizing 
eq. |27] to the smallest possible value (minimizing jbi implies 
minimizing jb2, but not vice versa) violates eq. l26l 

Thus, the most reasonable scenario is: setting jh^ as small 
as possible and balancing j;,i with eq. [26] This scenario is 
the relaxed ratio association applied to 5 (A), and as long as 
vertices of t/(A) are clustered, simultaneous optimizing eq.l26l 
and eq.|27]leads to the clustering of related features. Moreover, 
as j(,2 is minimum, B is approximately column-orthogonal. ■ 



C. Orthogonality constraints on B 

Theorem 3. Minimizing the following objective 

minjaB,C) = i||A-BC||| 
s.t. B > 0,C >0,B^B = 1 



(28) 



is equivalent to applying ratio association to Q{AA'^), and 
also leads to the item clustering indicator matrix C which is 
approximately row-orthogonal. 

Proof: By following the proof of theorem |2l minimizing 
Jc is equivalent to simultaneously optimizing: 



max tr (B-' AA^ B) s.t. B-' B = I, 

B ^ ' 

max tr (2C^A^AC 
c ^ 



(29) 
and (30) 

min tr (C^CA^AC^C) = min tr (CC^CC'^) . (31) 

Eq. [29] is equivalent to eq. [16] and leads to the clustering of 
related features. And optimizing eq. [30]and Eq. [3T] simultane- 
ously is equivalent to: 



max tr (CAC^ ) , and 



mm (^ ; ^ ^c,c, j 

i 


s.t. 


jal 

< C < Tc, 



(32) 
(33) 



i¥=j 



J02 



where A denotes the item affinity matrix, c^ denotes the z-th 
row of C, and Yc denotes the upperbound constraints on C. 
As in the proof of theorem [2] the most reasonable scenario 
in simultaneously optimizing eq. [32] and eq. [33] is by setting 
jc2 as small as possible and balancing jci with eq. [32] This 
leads to the clustering of similar items, and as jc2 is minimum, 
C is approximately row-orthogonal. ■ 

D. No orthogonality constraint on both B and C 

In this section we prove that applying the standard NMF 
to the feature-by-item data matrix eventually leads to the 
simultaneous feature and item clustering. 

Theorem 4. Minimizing the following objective 



1 



minJd(B,C) 2 
s.t. B > 0,0 0, 



-IIA-BCII 



(34) 



leads to the feature clustering indicator matrix B and the 
item clustering indicator matrix C which are approximately 
column- and row-orthogonal respectively. 

Proof: By following the proof of theorem [2] minimizing 
Jd is equivalent to simultaneously optimizing: 



maxtr(B^AC^) , and 



B,C 



mintr(B^BCC^ 

B,C ^ 



(35) 
(36) 



By substituting B = AC^ and C — B^A into the above 
equations, we get: 



('b'^ab') 



and 



(37) 



max tr 

B 

mill tr (B'^BB'^ AA'^B) = min tr (B^BB^B) (38) 



for feature clustering, and: 



(39) 



max tr f C AC j , and 

min tr (CA'^AC'^CC^) = min tr (CC^CC^) (40) 

for item clustering. Therefore, minimizing Jd is equivalent to 
simultaneously optimizing: 



maxtr 

B 



(b^ab) , 



-in(E(brb.)%E(bfb 
max tr ( C AC"^ ) , and 



2- 



mm 
c 



in(^(c.cf)%5:(c.cj) = 



ii^j 



(41) 
(42) 

(43) 
(44) 



s.t. < B < Yb, and < C < Tc, 

which will lead to the feature clustering indicator matrix B and 
the item clustering indicator matrix C that are approximately 
column- and row-orthogonal respectively. ■ 

III. Unipartite and directed graph cases 

The affinity matrix W induced from a unipartite (undi- 
rected) graph is a symmetric matrix, which is a special case 
of the rectangular affinity matrix A. Therefore, by following 
the discussion in section HIl it can be shown that the standard 
NMF applied to W leads to the clustering indicator matrix 
which is almost orthogonal. 

The affinity matrix V induced from a directed graph is 
an asymmetric square matrix. Since columns and rows of V 
correspond to the same set of vertices with the same order, 
as the clustering problem is concerned, V can be replaced by 
V + V^ which is a symmetric matrix. Then the standard NMF 
can be applied to this matrix to get the clustering indicator 
matrix which is almost orthogonal. 

IV. Related works 

Ding et al. [H provides the theoretical analysis on the 
equivalences between orthogonal NMF to X-means clustering 
for both rectangular data matrices and symmetric matrices. 
However as their proofs utilize the zero gradient conditions, 
the hidden assumptions (setting the Lagrange multipliers to 
zeros) are not revealed there. Actually it can be easily shown 
that their approach is the KKT conditions applied to the 
unconstrained version of eq. |2] Thus there is no guarantee that 
minimizing eq. |2] by using the zero gradient conditions leads 
to the stationary point located on the nonnegative orthant as 
required by the objective. 

Applying the standard NMF to the symmetric matrix leads 
to almost orthogonal matrix was previously proven by Ding 



et al. lfT6l . But due to the used approach, the theorem cannot 
be extended to the rectangular matrices which so far are 
the usual form of the data (practical applications of NMF 
seemed exclusively for rectangular matrices). Therefore, their 
results cannot be used to explain the abundant experimental 
results that show the power of the standard NMF in clustering, 
latent factors identification, learning the parts of objects, 
and producing sparse matrices even without explicit sparsity 
constraint [5}. 

V. Conclusion 

By using the strict KKT optimality conditions, we showed 
that even without explicitly imposing orthogonality nor 
sparsity constraint NMF produces approximately column- 
orthogonal basis matrix and row-orthogonal coefficient matrix 
which lead to the simultaneous feature and item clustering. 
This result, therefore, gives the theoretical explanation on 
some experimental results that show the power of the standard 
NMF as a clustering tool which are reported to be better than 
the spectral methods [l] and X-means algorithm |i2J. 
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